Spark: Persistence

In Spark, we can use some RDD’s multiple times. we repeat the same process of RDD evaluation each time it required or brought into action. This task can be time and memory consuming, especially for iterative algorithms that look at data multiple times. To solve the problem of repeated computation the technique of persistence came into the picture.It helps avoid re-computation of the whole lineage and saves the data by default in the memory.It makes the whole system

Time efficient
Cost efficient
Reduce the execution time.

Persistence is ...

Persistence in Spark RDD is an optimization technique in which saves the result of RDD.
Using this we save the intermediate result so that we can use it further if required.
It reduces the computation overhead.
We can make persisted RDD through cache() and persist() methods.
When we use the cache() method we can store all the RDD in-memory.
We can persist the RDD in memory and use it efficiently across parallel operations.

The difference between cache() and persist() is that using cache() the default storage level is MEMORY_ONLY while using persist() we can use various storage levels (described below). When the RDD is computed for the first time, it is kept in memory on the node. The cache memory of the Spark is fault tolerant so whenever any partition of RDD is lost, it can be recovered by transformation Operation that originally created it.

Features of RDD Persistence

RDD persistence facilitates Storage and reuse of the RDD partitions. When an RDD is marked for persistence, every node stores any of the RDD partitions computed in memory. It then reuses them in other actions on the dataset. This facilitates better speed.
Automatic re-computation of lost RDD partitions: If an RDD partition is lost, it is automatically re-computed using the original transformations. Thus, the cache is fault-tolerant.
Every persisted RDD is stored on a different storage level that is determined by the Storage Level object passed to the Persist method.

MEMORY_ONLY

In this storage level, RDD is stored as deserialized Java object in the JVM. If the size of RDD is greater than memory, It will not cache some partition and recompute them next time whenever needed. In this level the space used for storage is very high, the CPU computation time is low, the data is stored in-memory. It does not make use of the disk.

MEMORY_AND_DISK

In this level, RDD is stored as deserialized Java object in the JVM. When the size of RDD is greater than the size of memory, it stores the excess partition on the disk, and retrieve from disk whenever required. In this level the space used for storage is high, the CPU computation time is medium, it makes use of both in-memory and on disk storage.

MEMORY_ONLY_SER

This level of Spark store the RDD as serialized Java object (one-byte array per partition). It is more space efficient as compared to deserialized objects, especially when it uses fast serializer. But it increases the overhead on CPU. In this level the storage space is low, the CPU computation time is high and the data is stored in-memory. It does not make use of the disk.

MEMORY_AND_DISK_SER

It is similar to MEMORY_ONLY_SER, but it drops the partition that does not fits into memory to disk, rather than recomputing each time it is needed. In this storage level, The space used for storage is low, the CPU computation time is high, it makes use of both in-memory and on disk storage.

DISK_ONLY

In this storage level, RDD is stored only on disk. The space used for storage is low, the CPU computation time is high and it makes use of on disk storage.

Adding Persistence

import org.apache.spark.storage.StorageLevel
val data = sc.parallelize(1 to 10)
data.persist(StorageLevel.MEMORY_ONLY)
data.persist(StorageLevel.MEMORY_AND_DISK)
data.persist(StorageLevel.MEMORY_ONLY_SER)
data.persist(StorageLevel.MEMORY_ONLY_DISK_SER)
data.persist(StorageLevel.DISK_ONLY)

Unpersist

data.unpersist

Which Storage Level to Choose?

Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. We recommend going through the following process to select one:

If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)
Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.
Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.

Spark

Persistence

No comments:

Post a Comment